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ABSTRACT 



Compressed data objects are stored in a remote data store 
and are accessible to end users for data retrieval and pro- 
cessing. Descriptive information about data objects and 
processing logic is displayed and made available for viewing 
and access by end users having a client application which 
manipulates the descriptive information to create a custom- 
ized data retrieval and processing request. The request can 
be submitted to several data stores in succession for auto- 
mated data retrieval and processing without jeopardizing 
request security. The final request is submitted to the appro- 
priate data store(s) and/or vendor(s) for automated process- 
ing in accordance with the customized instructions, and the 
data result set is returned in a format specified by the end 
user. Householding capability and security features are 
available. End users can customize their householding 
parameters, including end user specifications to improve 
flexibility, speed, and efficiency of the householding pro- 
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METHOD AND APPARATUS FOR STORING 

DATA AS OBJECTS, CONSTRUCTING 
CUSTOMIZED DATA RETRIEVAL AND DATA 

PROCESSING REQUESTS, AND 
PERFORMING HOUSEHOLDING QUERIES 

This application claims the benefit of U.S. provisional 
patent application No. 60/133,675, filed May 11, 1999. 

FIELD OF THE INVENTION 

The invention relates to the field of databases, in 
particular, to data storage and data compression, and to 
technology for searching and retrieving data from a data 
store. The invention also relates to performing householding 
queries on end user data. 

BACKGROUND OF THE INVENTION 

There are many existing methods for constructing data- 
base queries, searching databases remotely, and retrieving 
search results. For example, U.S. Pat. No. 5,857,197 
(Mullins Jan. 5, 1999) entitled "SYSTEM AND METHOD 
FOR ACCESSING DATA STORES AS OBJECTS" 
describes, according to the Abstract, a system and a method 
for accessing a data store as objects from an object appli- 
cation. The accessed data store can be either an object data 
store or a non-object (e.g. relational) data store. The system 
includes an object schema including meta data correspond- 
ing to a data store schema and an adapter abstraction layer. 
The adapter abstraction layer comprises a first adapter, and 
a second adapter. One embodiment of the system includes an 
object schema manager to create and maintain the object 
schema at run time. It comprises a dynamic, scalable, 
centrally managed, and secure method for accessing data 
stored in both object and non-object (e.g. relational) data 
stores, effecting a consistent interface to the data store 
regardless of its underlying structure, or a method of trans- 
port and level of security. 

U.S. Pat. No. 5,787,411 (Groff et al July 28, 1998) 
entitled "METHOD AND APPARATUS FOR DATABASE 
FILTER GENERATION BY DISPLAY SELECTION" 
describes, according to the Abstract, a method for selecting 
records from a displayed database table by generating an 
SQL SELECT command for filtering the displayed records 
in accordance with cell values highlighted by user input. A 
presently selected set of records from a desired table 
(generally referred to as a record set or record source) are 
displayed on the user's display screen. The user selects 
particular values in cells (an intersection of a row and a 
column of the displayed table) by highlighting the values 
using the pointer device or keyboard of the computer 
system. Methods of the present invention then generate an 
SQL select (filter) command to selectively retrieve those 
records from the displayed records which match the user's 
highlighted values. A fully highlighted cell indicates exact 
equality is desired by the user, a beginning portion high- 
lighted indicates that the user wishes to match records whose 
corresponding column starts with the highlighted value, and 
ending portion selection matches the ending portion of 
qualified records, and a middle portion highlighted matches 
any record containing the highlighted value. Values high- 
lighted in the same row generate logically AND'd clauses in 
the SELECT command while the comparison generated for 
a row are logically OR'd with the comparisons generated for 
other rows. The user may indicate that the highlighted values 
are for selection (inclusion of qualified records) or for 
exclusion selection (exclusion of qualified records). A new 
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select (filter) command may be logically AND'd with the 
prior filter to permit complex selection criteria to be defined 
by simple graphical user inputs. 

U.S. Pat. No. 5,864,844 (James et al. Jan. 26, 1999) 

s entitled "SYSTEM AND METHOD FOR ENHANCING A 
USER INTERFACE WITH A COMPUTER BASED 
TRAINING TOOL" describes, according to the Abstract, a 
method for enhancing a user interface with a computer based 
training tool comprising the steps of listing domain objects 

10 on a display; listing domain object values in response to a 
selection of one of the domain objects; generating a plurality 
of inquiries in response to a user selection of one of the 
domain object values; replying with a predetermined 
answer; identifying a new domain object value in the 
predetermined answer; and adding to the plurality of inquir- 

15 ies a new inquiry which incorporates both the selected 
domain object value and the new domain object value. 

U.S. Pat. No. 5,787,412 (Bosch et al. Jul. 28, 1998) 
entitled "OBJECT ORIENTED DATA ACCESS AND 
ANALYSIS SYSTEM" describes, according to the Abstract, 

20 a system for accessing and analyzing data through a central 
processing unit. The system includes a non-modal user 
interface to provide a user access to the system. A number 
of application graphics objects allow the user to visually 
interact with a plurality of analysis objects through the 

25 non-modal user interface. The plurality of application analy- 
sis objects allow a user to interactively create an analysis 
network for analyzing one or more databases. A plurality of 
application data access objects automatically interprets the 
analysis network and allows the system to access required 

30 databases and to generate structure query language required 
to access and analyze the databases as defined within the 
analysis network. 

U.S. Pat. No. 5,787,425 (Bigus Jul. 28, 1998) entitled 
"OBJECT-ORIENTED DATA MINING FRAMEWORK 

35 MECHANISM" describes, according to the Abstract, an 
object oriented framework for data mining operates upon a 
selected data source and produces a result file. Certain core 
functions arc performed by the framework, which interact 
with the extensible function. This separation of core and 

40 extensible functions allows the separation of the specific 
processing sequence and requirement of a specific data 
mining operation from the common attribute of all data 
mining operations. The user may thus define extensible 
functions that allow the framework to perform new data 

45 mining operations without the framework having the knowl- 
edge of the specific processing required by those operations. 

U.S. Pat. No. 5,761,663 (Lagarde et al. Jun. 2, 1998) 
entitled "METHOD FOR DISTRIBUTED TASK FUL- 
FILLMENT OF WEB BROWSER REQUESTS" describes, 

50 according to the Abstract, a World Wide Web browser which 
makes requests to web servers on a network which receive 
and fulfill requests as an agent of the browser client, 
organizing distributed sub -agents as distributed integration 
solution (DIS) servers on an intranet network supporting the 

55 web server which also has an access agent servers accessible 
over the Internet. DIS servers execute selected capsule 
objects which perform programmable functions upon a 
received command from a web server control program agent 
for retrieving, from a database gateway coupled to a plural - 

60 ity of database resources upon a single request made from a 
Hypertext document, requested information from multiple 
data bases located at different types of databases geographi- 
cally dispersed, performing calculations, formatting, and 
other services prior to reporting to the web browser or to 

65 other locations, in a selected format, as in a display, fax, 
printer, and to customer installations or to TV video 
subscribers, with account tracking. 
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However, there is a need for a substantial improvement in 
the efficiency of several data-intensive industries. For 
example, the credit industry currently requires an average of 
at least thirty days to respond to a customer request for 
information necessary to execute a pre-approved credit card 
mailing. Typically, a customer wishing to execute such a 
mailing will send the credit company a sample of the types 
of households to which the customer would like to send its 
mailing, which sample includes records containing specific 
addresses, account numbers, household incomes, and other 
household information. Because of the size of the sample, 
the records are generally sent on magnetic tape, which, as is 
well known, entails handling, transportation and storage 
costs and delays inherent in physical transportation. The 
customer also typically sends specifications for supplement- 
ing this sample data with data from the credit company data 
store in order to compile a complete mailing list for the offer. 

The specifications might request that the data store iden- 
tify 20,000 households in the same zip codes and income 
ranges as the households provided in the sample. The 
specifications might also request that once these households 
are identified, the data store run its own logic on this data to 
determine the creditworthiness of each household contained 
therein (a "FICA score") and return only those households 
with a FICA score of an acceptable level. 

Finally, the specifications will specify the layout in which 
the customer would like to receive the finished report. 
Customer specifications may be significantly more complex 
than this example which has been simplified for the purposes 
of explanation. 

Upon receiving the request, the credit company must 
allocate resources to analyze customer specifications and 
ensure that it correctly understands the customer's needs and 
that the data that has been provided. Miscommunication 
between the credit company and the customer can lead to 
costly errors and reruns, particularly as requests become 
more sophisticated. 

Once the credit company believes that it understands the 
specifications, it will often be required to write custom 
software code in order to standardize the records provided 
by the customer (which may contain a variety of record 
formats and types) so that they will interface with the data 
store's standard processing operation. Preparing custom 
code for each customer requires large numbers of program- 
mers and time consuming data analysis, as can readily be 
appreciated. 

Upon completion of the analysis, the credit company will 
run its proprietary FICA score logic on the results and may 
run "house holding" logic to show relationships among dif- 
ferent individuals sharing the same address, or to identify 
and eliminate redundant information. Large portions of the 
householding process are generally "hard coded" in 
advance, making it difficult to adjust householding parsing 
and matching parameters to the particular data received or 
the end user's particular needs. As a result, data may be 
mismatched and relationships among records may be undis- 
covered. Furthermore, householding large files often takes 
days or weeks to complete because large files and cumber- 
some parsing and matching procedures slow the process and 
prevent the efficient use of system memory. 

Once processing is completed, custom code is then writ- 
ten to provide the results in the customer's preferred layout. 
Upon completion of the process, the credit company will 
likely return the results on magnetic tape because the 
volume of the data prohibits electronic transmission. This 
entire process may take one month or more, excluding 
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reruns if there are miscommunications or mistakes at any 
stage of the process. 

Therefore, a need exists for improvements in the area of 
database construction, queries, householding, and remote 
s searching. 

SUMMARY OF THE INVENTION 

It is, therefore, a principle object of this invention to 

10 provide a method and apparatus for storing data as objects, 
constructing customized data retrieval and data processing 
requests, and performing householding queries. 

It is another object of the invention to provide a method 
and apparatus that solves the above mentioned problems so 

15 that database efficiency can be improved. 

These and other objects of the present invention are 
accomplished by the method and apparatus disclosed herein. 

Advantageously, according to an aspect of the invention, 
a universal data object (UDO) is provided. A universal data 

20 object is a combination of data and function/logical pro- 
cessing instructions used to perform a particular database 
search request and to execute customized processing instruc- 
tions on the results of that request. Each universal data 
object consists of two distinct sections: (i) one or more data 

25 object headers and (ii) raw data. The universal data object is 
compressed at practically all stages of operation. The uni- 
versal data object may include, but is not limited to, 
extended binary coded decimal interchange code 
(EBCDIC— an IBM® system 360/370, 256 character code 

30 with 11 bits per character) and variable length files put in a 
standardized format, and an index of the compressed data. 
The invention provides for automated extraction of data 
object headers and capture of data profile statistics. 

According to an aspect of the invention, the user builds a 

35 fully executable processing request, not mere specifications 
for a request. 

According to an aspect of the invention, a universal data 
object carries large amounts of data married to processing 
4Q instructions. 

According to one aspect of the invention, the universal 
data object remains compressed at nearly all times. 

According to an aspect of the invention, compressed 
universal data objects are stored in a flexible data store and 
45 are accessible to remote end users for data retrieval and 
processing. 

According to an aspect of the invention, during 
compression, the data store is optimized for efficient data 
storage and retrieval by capturing data profile statistics and 
50 creating an index of the compressed data. 

According to an aspect of the invention, descriptive 
information about universal data objects and processing 
logic is displayed and made available for viewing and access 
by remote end users having access to a client application. 

According to an aspect of the invention, the client appli- 
cation manipulates the descriptive information to create a 
customized data retrieval and processing request. 

According to an aspect of the invention, an end user can 
60 add descriptive information about its own data to the request 
and attach its own compressed data to the request so that it 
will be included in data retrieval and processing. 

According to an aspect of the invention, descriptive 
information representing data and logic stored in multiple 
65 remote data stores can be integrated into one request. 

According to an aspect of the invention, the request can 
be submitted to several data stores in succession for auto- 
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mated data retrieval and processing without jeopardizing 
request security. 

According to an aspect of the invention, the final, cus- 
tomized request object, therefore, may contain: (i) descrip- 
tive information identifying data; (ii) descriptive informa- 
tion identifying logic to be executed on the data; and (iii) 
compressed end user data. 

According to an aspect of the invention, the final request 
is submitted to the appropriate data store(s) and/or vendor(s) 
for automated processing in accordance with the customized 
instructions, and the data result set is returned in a format 
specified by the end user. 

According to an aspect of the invention, the contents of 
the data store and the request can be almost completely 
processed entirely in their compressed form. 

According to an aspect of the invention, a method of 
performing house holding queries comparing input data with 
reference lists to produce data output is described, the 
method comprising building input patterns from data input 
submitted by an end user, displaying the input data for 
viewing by an end user utilizing a graphical user interface 
(GUI), allowing the end user to construct an output table 
containing output table symbols, mapping the input patterns 
to the output table symbols, mapping the output table 
symbols to a reference list, generating search patterns 
derived from the mapping of input patterns to the output 
table symbols that are mapped to the reference list, and using 
the search patterns to parse the data input against the 
reference lists to produce data output. 

According to another aspect of the invention, end users 
have broad flexibility in customizing their householding 
parameters, including end -user specifications to improve the 
flexibility, speed and efficiency of the householding process. 
End user householding specifications are entered through the 
invention's GUI, which allows end users to review summary 
information about their data and specify the data parsing 
(i.e., data cleansing and classifying) procedures that they 
wish the invention to perform. End users can access and use 
the GUI to tailor householding parameters to accommodate 
their parsing requirements. For example, an end user can 
tailor parsing parameters to ensure that non-traditional data 
inputs, such as data from different countries or different 
systems, are correctly parsed. 

According to another aspect of the invention, household- 
ing speed and efficiency are improved by using the infor- 
mation entered by the end user to prioritize parsing opera- 
tions and eliminate unnecessary processing. Specialized, 
indexed parsing tables are created in memory to further 
increase the speed of the parsing process. These parsing 
tables are designed to maximize use of system memory, thus 
eliminating extensive I/O exchanges that slow traditional 
householding processes. 

Advantageously, the present invention automates the 
existing process to a degree that was previously impossible. 
Data stored in the credit company data store, such as was 
described above, and is transformed into standardized, com- 
pressed data objects according to an aspect of the invention. 

According to an aspect of the invention, the universal data 
object is optimized for data storage and retrieval with 
methods for automatically capturing and displaying data 
profile statistics and storing data in indexed, compressed 
blocks that facilitate rapid data retrieval. 

According to an aspect of the invention, the credit com- 
pany data store is provided with a method for creating an 
electronic display of its data, data profile statistics, and data 
processing logic, which display is made available to cus- 
tomers. 
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According to an aspect of the invention, a customer 
accesses this displayed information and uses a GUI that 
generates an application programming language ("APL") to 
construct an executable request for data retrieval and data 

S processing from this displayed information. 

According to an aspect of the invention, information 
about the customer's data can be integrated into the request, 
and the customer's own compressed data can be attached to 
the request so that it will be automatically processed at the 

10 data store. 

According to an aspect of the invention, the request 
(including customer data, if any) is submitted to the data 
store electronically, and the request is executed in accor- 
dance with the APL instructions. 

According to an aspect of the invention, once the cus- 
tomer builds and sends the request, the entire process is 
automated and performed on compressed data, with almost 
no expanding of data at any time during the process, unless 

2Q it is passed to an external C program, householding, or other 
program that requires uncompressed data. 

According to an aspect of the invention, a compressed 
result set is automatically returned to the customer in a 
layout specified by the customer on the GUI. 

25 Thus, the invention dramatically improves the efficiency 
of the current process. Advantageously,because the cus- 
tomer can review the entire range of data and logic options 
made available by the data store and construct its own 
request, the data store does not need to interpret customer 

30 specifications or take responsibility for miscommunications. 
Instead, the customer can visit the credit company's Internet 
site and choose the household zip codes and income cat- 
egories necessary for its request by double-clicking on the 
credit company display. 

35 The customer can also double-click on available logic, 
such as FICA score, and use the GUI to specify that this 
logic will be executed on the result set of the data request. 

The customer has the option to specify that its own 
compressed data should be included in the request for 

40 processing or to be householded with data store data to 
identify relationships and eliminate redundant information. 

Once the executable request is completed, the request is 
electronically transmitted to the data store and the entire 
process is automated. Because the universal data objects and 

45 the customer data are already maintained in a standardized 
object format (e.g., with data object headers), the APL can 
read and manipulate specified data and logic without addi- 
tional intervention, and the customer data will be compatible 
with the data store data for householding or other purposes. 

The data is returned in a layout specified by the end user 
on the GUI. Thus, advantageously, data object standardiza- 
tion and layout specification eliminate the need for program- 
mers to write code to submit customer data to the credit 

5S companys standard process and to prepare the report layout. 
Unlike the current process, the process according to an 
exemplary embodiment of the invention, is performed with 
almost no expanding of data at any stage (unless it is passed 
to an external program that requires uncompressed data), 

60 thus saving valuable storage space. Advantageously, the 
indexing of the data store data enhances the efficiency of 
data retrieval and reduces the run time for the entire process. 

By contrast, ordinary data stores cannot implement a 
similar process with currently-available technology. 

65 The efficiencies described herein are achieved according 
to an embodiment of the invention, with an integrated 
system comprised of (i) compressed, universal data objects 



04/19/2004, EAST Version: 1.4.1 



US 6,708, 

7 

(at the data store and optionally at the customer) and (ii) a 
consistent APL to access and process data display informa- 
tion and universal data objects at both the data store and the 
end user site; and (iii) an efficient compression method that 
can process customer and data store data with almost no 5 
uncompressing unless data is passed out of the system or 
householded according to the end user's instructions. 

Advantageously, the integration of data objects, data 
displays, the APL and the compression method into a 
coherent system make it possible to automate data exchange, 10 
data retrieval and data processing to an extent previously 
impossible. 

In contrast with current products, the inventions allows 
the end user to control householding specifications directly 
and to tailor householding parameters according to the end 15 
user's data input and desired output. Through use of a GUI, 
the end user can manipulate data input and parsing and 
matching criteria to produce a highly tailored, householded 
output, without intervention by a third party and without 
submitting data to a standardized "black box" process. 20 

The invention improves the speed and efficiency of the 
householding process by using the end user's householding 
specifications to prioritize parsing operations and eliminate 
unnecessary processing. The invention also uses the end 
user's specifications to reduce the amount of data against 25 
which the end user's data is parsed by creating smaller 
parsing tables with indices that allow rapid data lookup and 
retrieval. The invention maximizes the use of system 
memory, thus eliminating extensive I/O exchanges that slow 
traditional householding processes. 30 

It should be kept in mind that the process set forth above 
is illustrative of one exemplary application of the invention, 
but there are many other possible applications. In general, 
the invention has the potential to generate efficiencies wher- 
ever end users access remote data stores or seek to house- 
hold different records. 

As noted above, so long as data stores store their data in 
the prescribed object form, it is possible to submit a request 
to multiple data stores, so that data might be obtained from ^ 
one data store and automatically submitted to processing 
logic obtained at another data store, or a request for data 
retrieval might be constructed using information from sev- 
eral data stores and then sent seriatim to each data store to 
retrieve all relevant information and return the results to the ^ 
end user. 

These and other aspects of the invention will become 
apparent from the detailed description set forth below. 

DESCRIPTION OF THE DRAWINGS 
FIG. 1 shows the process of creating a data object from 50 
input data. 

FIG. 2 shows the process of creating an executable 
request pertaining to the data object. 

FIG. 3 shows the process of executing the request to ^ 
provide a resultant data object. 

FIG. 4 shows additional details of the process of creqting 
the data object from input data. 

FIG. 5 shows the process of forming a data object header 
for the data object and providing that data object upon 60 
request to the user. 

FIG. 6 shows the process of generating search patterns 
from the input data. 

FIG. 7 shows the process of generating parsing and 
execution tables for conducting searches of the input data 55 

FIG. 8 shows the process of performing searches upon the 
input data to find matches to the parsing table. 



166 Bl 

8 

FIG. 9 shows the process of finding the frequency of 
matches of input data to the parsing table according to 
primary and secondary search positions. 

FIG. 10 shows the process of performing a search to find 
a match of input data to the parsing table based on deviation 
permitted by fault tolerance logic. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT(S) 

The invention will now be described in more detail by 
way of example with reference to various exemplary 
embodiments. It should be kept in mind that the following 
described embodiment(s) is/are only presented by way of 
example and should not be construed as limiting the inven- 
tive concept to any particular physical configuration. 

The invention concerns a method and apparatus for con- 
structing and executing customized data retrieval and data 
processing requests for compressed universal data objects 
stored in a data store. The data store described herein refers 
to either a local or remote data store, or both. The data store 
is optimized for effective data storage and retrieval in order 
to facilitate such requests. 

In a preferred embodiment of the invention, the data store 
contains indexed, compressed data objects, and all of the 
operations described herein are performed with almost no 
uncompressing of the data at any stage, except when data is 
moved into memory for householding, as described herein. 
The ability to create and manage such a data store of 
compressed universal data objects can be purchased com- 
mercially from Norbert Technologies, LLC under the trade 
name CorWorks®. Requests can be executed on a wide 
variety of data types and formats, including ASCII 
(American Standard Code for Information Interchange) 
data, EBCDIC (extended binary coded decimal interchange 
code) data, variable length files and data stored in data base 
extract files and on personal computers. 

The invention enables the electronic display of informa- 
tion about the contents of a data store and logic made 
available by such a data store. A client application in remote 
communication with the data store allows an end user to 
view, access and manipulate this data store information 
electronically to build a customized request for data retrieval 
and data processing. 

Optionally, the end user may include its own compressed 
data with the request so that automated processing can be 
performed on end user data in addition to the data store data. 
Upon submission of the completed request, the request is 
automatically executed at the remote data store and the 
results are returned in a format specified by the end user. A 
preferred embodiment of the invention will now be 
described in more detail. 

The invention supports the use of householding. The GUI 
permits an end user to design a highly specific and tailored 
householding request. After an end user specifies its house- 
holding requirements, the system uses this information to 
increase the speed and efficiency of the householding pro- 
cess by (i) tailoring the parsing table based upon the end 
user's request and moving the tailored parsing table into 
memory to reduce time-intensive I/O processing; (ii) creat- 
ing a series of execution tables to rapidly eliminate bad data 
and avoid searches that will not match the data to the parsing 
table; and (iii) indexing the parsing table to facilitate record 
lookup. 

DATA OBJECT CONSTRUCTION 
An exemplary embodiment of the invention is imple- 
mented in connection with a data store comprised of com- 
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pressed data objects. Each data object consists of two 
distinct sections: (i) a data object header and (ii) raw data. 
The data object header contains information that identifies 
and describes the raw data contained in the data portion of 
the data object. 5 

Data object header information may include, for example, 
the type and length of various record data fields, the date 
each data record was created, the key field that identifies 
each individual record, the record file type, and other 
information about the raw data contained in the data object. 10 

Individuals skilled in the art will recognize from the 
process described below that a wide variety of additional 
information can be included in the data object header within 
the spirit and scope of the invention. 

A "Merge" program, described below, creates the data 
object and maps the data object header information to the 
raw data as noted at header operation 502 of FIG. 5. Because 
the raw data is mapped to the data object header by the 
merge, the raw data is advantageously not stored in a ^ 
physical table or other rigid schema. 

The compressed data objects contained in the data store 
are created by executing a Merge program against selected 
data files as mentioned above. The Merge program can be 
run against various types of files from a wide variety of 25 
sources, including American Standard Code for Information 
Interchange (ASCII) files, extended binary coded decimal 
interchange code (EBCDIC) files, database extract files, and 
file types stored on personal computers, including files 
stored, for example, in Microsoft Access® and Microsoft 3Q 
Excel®. As discussed in more detail below, Merge can read 
and compress variable length records and complex data 
types. 

Before executing Merge, an end user uses a client appli- 
cation installed on a personal computer to select and prepare 35 
files for compression and storage in data object form. In a 
preferred embodiment, the client application includes a 
point-and-click graphical user interface (GUI) that controls 
an application programming language (APL). The APL is 
used to select files and prepare files for submission to Merge. 40 
A personal computer supporting the client application 
should have, for example, at least the following minimum 
specifications: a 233 Mhz processor; 16 megabytes of RAM; 
50 megabytes of available hard disk; a monitor; and a 
mouse. 45 

An operation according to an exemplary embodiment will 
now be described (See FIGS. 1, 4, and 5) Using the GUI, the 
user identifies files for conversion to data object form by 
double -clicking on the desired files, for example, as noted at 
selection operation 402 of FIG. 4. The GUI then generates 50 
a series of screens to prompt the user to enter information 
that will eventually be stored in the data object header, as 
noted at input operation 102 of FIG. 1, and further at entry 
operation 404 of FIG. 4. 

As a first step, the GUI requests a Source I.D., which will 55 
be attached to the data file and retained in the header when 
that file is converted to data object form. A Source I.D. 
identifies by source the types of records contained in the file. 
For example, a file with records originating in a company's 
checking account system might be given a Source I.D, of 60 
"001," while a file with records originating in a company's 
credit card system might be given a Source I.D. of "002." 

After choosing a Source I.D., the user designates a key 
field for each file, which is a record field or combination of 
record fields that will be used to uniquely identify a record 65 
or group of records in the file. The key field can be a 
combination of the Source I.D. with a particular data field or 
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other information selected by the end user. For example, an 
end user might identify each record contained in a file of 
checking account records by constructing a key field that is 
a combination of the Source I.D. "001" and the value 
contained in record data field "Account Number/' Thus, if a 
checking account number and a credit card account number 
happen to be identical, the Source I.D. will differentiate the 
key field information. If a file is received with records that 
lack a logical or potential key field, a virtual key field can 
automatically be constructed, either assigning key field 
numbers sequentially or allowing the end user to choose its 
own key field system. 

Another part of the data object header according to an 
exemplary embodiment of the invention is record layout 
information. If the selected file originated in a relational 
database management system (RDBMS), Merge will 
directly read and interpret the layout information using the 
RDBMS catalog. Merge performs an analogous process on 
personal computer-based databases. In the event that the 
data is not stored with a catalog, however, the end user may 
use the GUI to select an actual record with layout that is 
representative of the other records, or, alternatively, to 
create, a record layout which is representative of the records 
stored in the file and which will be used to identify the record 
fields and field attributes. 

In order to create a representative record layout using the 
GUI, the end user must know how many fields are included 
in each record that the end user wishes to bring into the 
system, the order in which the fields appear, and how many 
bytes are included in each field. The end user enters the 
name of each field into the GUI in the order in which the 
fields appear in each record and also enters the number of 
bytes included in each field. The record layout contains the 
same fields as each of the records stored in the file, and each 
field in the record layout has the same attributes, e.g., 
number of bytes, text vs. numeric, as the corresponding 
record field stored in the file. 

An individual skilled in the art will recognize that the 
Source I.D., key field and record layout examples included 
herein are not exhaustive and that the invention can accom- 
modate a wide variety of data object header information, 
e.g., date of file creation, location of file creation, file owner, 
in order to better identify and/or describe the raw data that 
will be contained in the data object. 

An embodiment of the invention accepts unformatted raw 
data, e.g., images, sound, word processing files, as well as 
formatted raw data. Unformatted data does not require 
layout information. Data object header information may 
include data size, date and location of file creation, and other 
information, which may be appropriately associated with 
unformatted data. 

Once this preparatory work is completed, the job can be 
submitted to Merge for data object creation as noted at 
process option 104 of FIG. 1. Merge executes on a variety 
of platforms, including servers running UNIX or Microsoft 
NT® or personal computers running Microsoft Windows®, 
for example, provided such servers or personal computers 
have approximately 256 megabytes of memory and 100 
megabytes of available disk space. 

Upon execution, Merge scans the data and associated data 
definitions contained in the selected data files as noted at 
scan operation 406 of FIG. 4, maps the data to the previously 
defined header layout information as noted at mapping 
operation 408 (or reads the catalog to ascertain the header 
layout information and map the raw data) and generates data 
profile statistics as noted at statistical operation 410. Merge 
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also compresses the data object header information and raw not found, a binary search is begun for the key field, 

data as at compression operation 412 to complete the searching either midway up the block or midway down the 

compressed data object as noted at formation operation 504 block, depending upon whether the key field requested was 

of FIG. 5. By counting bytes as the data input is brought into higher or lower than the first key field identified in the block, 

the system, Merge can use the pre -entered field length s Once the appropriate key field is found, only those records 

information to insert a break at the end field and to map the having an identical key field are searched to identify a match 

data input to the pre -entered field names in the order that the for the record. 

data input is received. The process by which the invention Other than the expansion just mentioned, data object 

counts bytes and identifies data values and other information header and raw data are kept in and manipulated in a 

as data input is brought into the system is referred to as 10 compressed state. 

scanning the data. Upon comp i et i on 0 f the Merge process, all selected data 

As noted above, Merge converts a wide variety of file Q\ t& have been converted to indexed, compressed data 

types to compressed universal data object form, which objects, each comprised of object data headers and raw data, 

facilitates the use of the invention with a wide variety of data The header contains layout information and other descrip- 

types. In addition to its ability to convert ASCII files, Merge ^ tive m f ormat i ori) including data profile statistics. The raw 

converts EBCDIC files to compressed universal data object data portion of the object includes the actual data and the 

form directly from their mainframe source, without an data index, 

interim conversion to ASCII. Advantageously, the ^universal Thc Mergc pf0cess described abovc ^ be reproduced by 

data objects created directly from these EBCDIC files are ^ u&er for eacfa balch of files received for processing, 

identical to the universal data objects created from ASCII ™ Alternatively, the user can select initial settings and then 

fi Ies - automate the process, so that the invention will automati- 

Furthermore, Merge scans variable length files. In order to ca n y convert files into data object form and load the data 

read variable length records, the end user uses the GUI to store according to a schedule defined by the user, 
specify the record length of the variable length records 

according to a pre-existing layout. The GUI permits the end 25 BUILDING AND SUBMITTING CUSTOMIZED 

user to manually specify the length of each record or to enter REQUESTS 

conditions that will determine how long each individual , 

record will be. Finally, Merge compresses all 256 EBCDIC ^ exemplary embodiment of the invention allows end 

characters, so data quality will not be compromised even on users to * CCeSS ^ ° bj6C * ° f , d< T * b °™ at 

complex data types 30 remote data stores f° r retrieval and manipulation. 

As Merge brings data into the system, it uses system ^ * method according to an exemplary embodiment of 

resources to generate and store data profile statistics. For me invention, a data store makes data object header infor- 

example, Merge keeps a running tally of each data value that mation available so that such information can be viewed by 

it scans and records, among other things, minimum and „ potential end users. An end user using a client application 

maximum data values for each field and a specified number 35 * electronic communication with the data store, views the 

of the most common data values for each field. An individual data object header information that the data store has made 

skilled in the art will recognize that a wide variety of available. The client GUI allows the end user to access those 

sophisticated data profile statistics and derived data values data object header items that are of interest to the end user 

can be automatically generated as Merge scans the data input 4Q ™« to manipulate those data object header items to build a 

brought into the system. customized request for data and data processing. (See FIGS. 

Data profile statistics generated as data is brought into the ^ an( * ^ 

system (e.g., minimum and maximum data values) are Optionally, an end user having data stored in data object 

included in the data object header. Header information format can combine its own header information with the 

remains flexible and can be used to store additional infer- 45 re ^ est and attach its own compressed data to be submitted 

mation in the future. The invention permits an end user to with the request. Once the request is completed and the end 

append new information to the object and to update header user's compressed data, if any, is included with the request, 

information accordingly, without running a subsequent the compressed request is submitted to the data store elec- 

Merge operation on the entire updated file. Ironically for automated execution, as noted at execution 

The scan and compression components of Merge work 50 °P eratlon 302 of FIG - 3 

together to store the data objects in an indexed, compressed The data store execute the data request and any logic 

format as noted at store operation 106 of FIG, 1, Using the specified by the request and returns a customized result set, 

key field identified by the end user, which key field identifies as noted at object operation 304 of FIG. 3, to the end user 

each record in the file, the scan feature builds an index of key in a format designated by the end user, as noted at forward 

field values and organizes it into alphabetical or numeric 55 operation 306 of FIG. 3. 

order (depending upon the key field). The Merge also sorts In a method according to an exemplary embodiment of 

the records themselves into alphabetical or numeric order by the invention, data object hearder information is made 

key field as they are compressed and stores them in com- accessible to remote end users via, for example, the Internet 

pressed data "blocks." or an internal network such as through transfer operation 

Associated with each value in the index is the logical 60 506 of FIG. 5. An object of this method is to permit a data 

address of the data block containing that key field data value. store to create an electronic header display describing the 

When the end user requests a record, the record key field is types of information stored in the data store, and to permit 

identified and then the index is checked for the logical end users to view and access this header display, 

address of the block containing the particular key field value. In order to display data object header information, the 

Upon reaching the appropriate logical block of data, one row 65 data store runs a "Get Header" request against a data object 

of data in the middle of the block is expanded and an attempt to extract the header information from the data object. Get 

to identify a match for the key field is made. If a match is Header identifies the targeted data object by reference to a 
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data object identifier, which is an internal object naming 
convention constructed during the Merge process. The Get 
Header process extracts the header information only; it does 
not extract any raw data. It also does not affect the data 
object, which remains stored in its original form with header 
married to data. The extracted header information forms a 
separate and distinct "extracted header object" containing 
only header information. A file transfer protocol or other 
transfer mechanism is then used to transport the extracted 
header object to a central storage area. 

In a preferred embodiment of the invention, the extracted 
header object is transported to and maintained on a network 
server. Using commercially available display tools, such as 
Microsoft's Visual Basic®, header information is rendered 
available for viewing by potential end users. Thus, remote 
end users in electronic communication with the data store 
network server can view the extracted header object infor- 
mation in order to survey the kinds of data contained in the 
compressed data object store, including any data profile 
statistics created during the Merge process. 

The invention also includes a method whereby proprietary 
logic (i.e., data processing instructions) can be made avail- 
able for remote viewing and selection by end users. Using a 
programming language, such as C, Cobol or SAS®, logic of 
any level of complexity can be written and stored at the data 
store. Once the logic is created, the invention allows the end 
user to create an interface between the logic and the APL, 
which specifies the appropriate input and output parameters 
for the logic. The logic will indicate the data input that it 
requires and how that data input should be provided. Using 
the layout information already stored in the header of the 
data object, the invention allows the end user to map the 
fields required by the logic that the end user wishes to access 
into the input positions required by the logic. Once this 
mapping is completed, the logic can be called by an APL 
request created by an end user and the invention will provide 
the data values corresponding to the mapped fields to the 
logical process as required. Upon conclusion of the process, 
the logic will provide specified data outputs, which will be 
mapped by the end user back into the appropriate fields for 
further processing, as necessary. An electronic representa- 
tion of available logic can also be created for storage and 
display on a network server so that it will be available for 
viewing and access by end users. 

According to another aspect of the invention, an end user 
can access the header and logic display, created according to 
the method described above, for example. In this aspect, the 
client application is hosted on a personal computer, for 
example, which is in electronic communication (possibly 
over the Internet or a private network) with the network 
server storing the header information and logic representa- 
tions. The client application interfaces with an Internet 
browser and acts as a browser application permitting the end 
user to use the GUI to access the data store header infor- 
mation and logic, upon being extracted from the header at 
extraction operation 202 of FIG. 2, that is stored on the 
network server, as noted at access operation 204 of FIG. 2. 
The GUI uses the same APL that was used to create the data 
object store and is, therefore, compatible with the objects 
contained therein. 

According to a preferred embodiment of the invention, 
the end user uses the GUI to select desired header informa- 
tion and logic from the data store display using intuitive, 
point-and-click commands. The GUI APL manipulates 
header information and logic representations created accord- 
ing to the process set forth above. Header information and 
logic representations corresponding to choices made by the 
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end user are transferred from the network server to the client 
application using standard data transfer techniques and 
protocols, such as TCP/IP for the Internet. Thus, the end user 
captures header information identifying specific raw data 

5 contained in a remote data store and information identifying 
logic to be executed on this data. 

As noted above, header information can represent a 
specific record or group of records, a specific data field 
contained in all of the records, a rank ordering of the values 

10 most frequently appearing in a particular field, the minimum 
and maximum values contained in a particular field, or a 
wide variety of other variables created by the Merge. Logic 
can include any process created and stored at the data store. 
As also noted above, the data store has the flexibility to 

15 change or amend the available headers and available logic. 
The GUI is used to manipulate header information and 
logic to create a customized request for data retrieval and 
data processing. In this exemplary method according to the 
invention, the retrieved header information and logic are 

20 displayed on the client GUI, and the end user points- and- 
clicks to manipulate the header information and logic to 
create a customized request. As the end user manipulates the 
header information and logic represented on the GUI screen, 
the client application generates APL to integrate the header 

25 information and logic into an executable request correspond- 
ing to the end user's commands, as noted at request opera- 
tion 206 of FIG. 2. The APL will call the raw data and logic 
corresponding to the accessed header information and logic 
representations after the request is constructed and executed, 

30 as specified by the end user. 

In the preferred embodiment the client application con- 
tains standard instructions, which the end user may select to 
structure a request. Such standard instructions might include 

35 simple data retrieval requests, e.g., a "Get Data" command, 
condition logic, e.g., "if7then" statement, or standard com- 
mands to simplify query construction, e.g., "and/or," 
"greater than or equal to". 
Alternatively, a programmer using APL can write cus- 

40 tomized instructions that can then be integrated into the final 
end user request. 

According to another aspect of the invention, requests can 
be built either online or offline. In one exemplary embodi- 
ment of the invention, the end user request is built while the 

45 client application is in active communication with the data 
store. In another exemplary embodiment of the invention, 
the end user can save the header information and logic 
representations and build a request offline, which request can 
be submitted when the end user reestab fishes contact with 

50 the data store. 

In yet another exemplary embodiment of the invention, a 
data store could store this information on any medium, e.g., 
a pre-packaged array of header information distributed on 
CD ROM, to permit an end user to construct a request offline 

55 and access the data store only to submit the request. 

According to another aspect of the invention, an end user 
building a request according to the method described herein 
has the option to integrate its own header information and 
logic representations into the request and to attach its own 

60 data to the request. In order to construct such a request, an 
end user having an object data store of the type described 
above retrieves header information from its own object data 
store. This header information is stored locally for access by 
the client application. The end user builds a request in the 

65 manner described above, but in this case the end user 
integrates the data store header information and logic with 
its own header information and logic. 
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The GUI is designed to accommodate data store header 
information and logic and end user header information and 
logic on the same GUI screen at the same time. When the 
end user has finished constructing a request combining data 
store and end user header information, the end user attaches 
compressed data corresponding to the end user header 
information included in the request. The resulting universal 
data object, therefore, contains a header, which may include 
end user header information and logic and data store header 
information and logic. The universal data object may also 
include the end user's compressed raw data, which corre- 
sponds to the end user header information. 

According to another aspect of the invention, the end user 
uses the GUI to specify the layout in which the final output 
will be returned to the end user. In a manner similar to the 
manner used by the end user to create the record layout for 
the Merge process, the end user selects (or creates) a record 
layout that contains the fields and field attributes necessary 
to accommodate the end user's expected result set. These 
result layout specifications are included in the header of the 
request sent to the data store, and the result is formatted 
according to the end user's desired result layout. 

PERFORMING HOUSEHOLDING ON END 
USER DATA 

Householding operations can be performed using end user 
data. Although one embodiment of the invention performs 
householding on compressed universal data objects, 
householding, as described herein, can also be performed as 
a discrete function on raw data that is not stored in universal 
data object form and not transmitted over a network. 

Householding is a process whereby relationships among 
different records are identified. For example, householding 
might be used by a bank to identify a relationship between 
a checking account held at the bank and credit card account 
information held at a remote credit bureau, even though the 
information pertaining to these accounts is stored on differ- 
ent systems and the only information in common between 
the two record types is the customer's name and address 
(household). 

To perform this operation, the bank would first parse the 
data in order to understand the input data and to eliminate 
incorrect records and/or records that the end user does not 
wish to include in the request. To parse the input data, the 
end user would attempt to match a checking account record 
to a householding reference list (the "Reference list"), such 
as complete U.S. postal records, in order to verify that the 
record information is current and correctly entered. 

Once the information is verified as accurate, the system 
then attempts to match the checking account record name 
and address to a credit bureau record name and address. 
Once a match is found, the system can alert the end user to 
the relationship between the two accounts or insert a "hard 
link" to ensure that the relationship between the accounts is 
always noted. This information might be used to eliminate 
redundant promotional mailings or to allow a company to 
cross-sell products from different parts of the business to 
existing customers. 

The invention includes a householding process that allows 
the end user broad flexibility to specify the data patterns that 
the end user will accept as a match for its particular parsing 
and householding purposes, without sacrificing perfor- 
mance. For example, end users can specifically tailor match 
parameters for jurisdictions having different naming or 
addressing conventions, or craft match parameters tailored 
to their specific raw data or intended use. It also allows end 
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users to tailor how they will compensate for data entry errors 
contained within end user records. 

In a method according to the invention, the end user enters 
data input from a record layout information for the file that 

5 the end user wishes to household into a GUI, in the manner 
previously described (if such information is not already 
stored in universal data object form). The invention maps the 
layout information to the file raw data as that data input is 
brought into the invention, in the manner previously 

10 described. Alternatively, if the selected file originated in a 
RDBMS, the invention could directly read and interpret the 
layout information using the RDBMS catalog. An individual 
skilled in the art will recognize that there are a number of 
different ways to enter and store layout information and raw 
data in a manner that permits mapping of layout information 

15 to the data input. As each file is brought into the invention, 
each record in the file is assigned a sequential number for 
identification purposes. 

In a preferred embodiment of the invention, the invention 
will standardize characters for parsing and matching as data 
input is loaded into the invention, according to preferences 
indicated by the end user. In the preferred embodiment, the 
invention displays all 256 potential characters on a GUI 
screen. An end user can point and click to specify which, if 
any, characters are to be changed as the data input is loaded 

25 into the invention in order to ensure that data is in the format 
desired by the end user for parsing against the chosen 
Reference list(s). For example, an end user could indicate 
that all data should be changed from lowercase to uppercase 
characters. An end user might also compensate for a known 
data entry error by entering instructions to change a par- 
ticular symbol, letter or number appearing in a specific field 
to another symbol, letter or number that corresponds to data 
conventions in a reference list. 

35 Is In a preferred embodiment of the invention, the end 
user uses the GUI to parse the data by specifying those 
relationships between records that the end user wishes to 
show as a "match." End users may specify, field-by-field, the 
kind of match that will be deemed acceptable. For example, 

40 an end user might build a householding request that identi- 
fies the relevant fields as the "name" "street," "city," and 
"zip code" record fields, while excluding the "account 
number" field as irrelevant. The end user would then con- 
struct several different potential record structures to capture 

45 the data in an attempt to determine which items of raw data 
fall into fields that match the end user's reference list. 

An end user might create one scenario in which the zip 
code will appear in the fourth field, one scenario where it 
will appear in the fifth field and one scenario in which it will 

50 appear in the sixth field. The end user could then build in 
fault tolerance for particular fields. 

For example, the end user might specify that the request 
will tolerate a letter transposition or missing letter as a match 
in the "city" field or in the "street" field so long as the "zip 

55 code" field is an exact match. Alternatively, the end user 
might specify that it will tolerate a typographical error in the 
"zip code" field so long as the "city" and "street" fields are 
exact matches. 
In another variation, the end user might tolerate a trans- 

60 position of the first and last names in the "name" field so 
long as "address," "city" and "zip code" are exact matches. 
Each of these alternative "match" scenarios are constructed 
by the end user with the GUI, and each record will be tested 
through each scenario in turn in order to identify whether or 

65 not a specific address will be accepted for householding. 
An application of the preferred embodiment of the inven- 
tion to a file containing many customer records will now be 
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described in more detail by way of example, with particular 
reference to customer "Name" information stored in each 
record. (See FIGS. 6-10) 

After the relevant data file is loaded into the invention and 
mapped to a layout, as noted at layout operation 702 Of FIG. 
7, according to the process described above, the end user 
uses the GUI to select the "Name" field as the first field for 
parsing. Data input for the Name field will contain several 
different kinds of customer names, which may include a 
customer's name prefix (e.g., "Mr.", "Mrs.", "Dr."), first 
name, middle name, last name, and name suffix (e.g., "Jr.", 
"3 rrf ", "Ph.D.") and/or other variations. Once the end user 
selects Name as the relevant field, the invention will build 
"Input patterns" for the field by checking each byte in the 
Name field for each record in order to determine whether 
each byte is a letter or a number, as noted at pattern operation 
602 of FIG. 6. If the byte is a number, it is assigned "N" to 
identify it as a number. If the byte is a letter, it is assigned 
an "L" to identify it as a letter. If more than one byte 
containing a letter is strung together, without an intervening 
space, the byte string is assigned a "W" to identify it as a 
word, Symbols, such as "#", "?", or "— " are generally 
represented in their original form, subject to the user's 
standardization preferences, as previously described. The 
invention is designed with particular conventions interpret- 
ing unorthodox strings, (e.g., a letter and a number appear- 
ing consecutively, without an intervening space, could be 
identified as "LN", while other conventions could be for 
other unorthodox strings). 

Once the invention completes its analysis of the Name 
field for all records, the invention will display the various 
input patterns included in that field for each record, as noted 
at display operation 604 of FIG. 6. For example, in a 
100,000 record file, the invention might indicate that the 
input pattern "WWLW-W" W (Word, word, Letter, Word- 
Word) occurs 5,000 times in the Name field. An example of 
a name input corresponding to this input patter is "Mrs. Ann 
A. Smith- Jones," Another example of a name input corre- 
sponding to this input patter is "The Honorable A Smith- 
Jones". Many other input patterns, such as "WLW" (e.g., 
"Ann A. Smith") and "WLWWN" (e.g., Mr. A. Albert Smith 
3 rd ) would be generated by a typical analysis of a large 
number of Name fields. 

In a method according to the invention, the end user uses 
the GUI to identify the householding reference list against 
which the end user ultimately wishes to parse the data input. 
A separate analysitcal file is automatically created with data 
about the referense list. For example, the invention examines 
the reference list to determine the number of bytes contained 
in each field of each record and stores this field length 
informaiton with the reference list, as noted at list operation 
710 of FIG. 7. In the event that the end user wishes to parse 
itsi data against more than one referense list, multiple 
referense lists can be designated by the end user. For 
example, an end user would typically use separate commer- 
cially available reference list(s) to parse Name.Prefises, 
Name Suffixes, and First Names before parsing the entire 
data input for a given record against a comprehensive 
reference list. 

The end user then uses the GUI to construct an output 
table for the input data, which output table consits of entries 
(output table entries) corresponding to data contained in the 
end user's chosen reference list(s), as noted at table ogpton 
606 of FIG. 6. For example, an output table for the Name 
field would likely consist of at least the following output 
table entries: "Name Prefix," "First Name," "Middle Name." 
"Last Name," and "Name Suffix," which fields would also 
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exist in the reference list(s). In order to save memory space 
later tin the process, each output table entry is represented by 
a single letter or symbol (output table symbols). For the 
foregoing list of output tables entries, representative output 

s table symbols could be "P" for "Name Prefix," followed by 
"F," "M," "L," and "S" represented each of the correspond- 
ing output table entries. 

Once the output table is constructed, as noted at table 
operation 704 of FIG. 7, the end user uses the GUI to map 

10 the output table symbols to the reference list, as noted at man 
operation 610 of FIG. 6. To accomplish this, the end user 
uses a computer mouse to physically drag and drop graphi- 
cal representations of the output table symbols onto graphi- 
cal representations of the reference list(s) fields, and the 

15 invention builds corresponding programming script to link 
these output table symbols to the reference list(s) fields 
according to the relationships indicated by the end user. 
Thus, for example, the end user might map output table 
symbol "P" for "Name Prefix" to the Name Prefix field of 

20 the reference list. The invention allows the end user to map 
a single output table symbol to multiple reference list(s) in 
the event that the end user wishes to check the data input 
against data stored in more than one reference list. At this 
stage of the process, the end user has (i) a range of input 

25 patterns generated from the data input that the end user 
wishes to parse and (ii) an output table containing output 
table symbols that are mapped to the end user's reference 
list(s). 

In a preferred embodiment of the invention, an end user 

30 having input patterns and an output table evaluates the input 
patterns and maps each position in each input pattern to one 
or more output table symbols in the same manner as the 
output table symbols were mapped to the re feme de 1 is t(s), as 
described above, as noted at map operation 608 of FIG. 6. 

35 The invention allows the end user to view data input 
corresponding to a given input pattern in the event that the 
end user wishes to generate ideas for possible mapping to the 
output table. In the event that the end user does not wish to 
map some input patterns to the output table (e.g., because 

40 they represent so little of the data input that the end user does 
not wish to parse them), the end user may simply leave those 
input patterns unmapped. 

The invention uses the end user's mapping of the input 
patterns to the output table to generate search patterns for the 

45 input patterns, which search patterns represent all of the 
possible permutations of the input table/output table map- 
ping completed by the end user as noted at pattern operation 
612 of FIG. 6. This process generates several different 
search patterns through which data input will be parsed. 

50 Thus, the search patterns represent the end user's mapping 
of input patterns to output table symbols, which output table 
symbols are in turn mapped to the reference list(s). Data 
input corresponding to each given input pattern will be 
tested through all of the search patterns associated with that 

55 input pattern until the invention locates a match to the 
reference list(s) or until the search patterns are exhausted, as 
described in more detail below and as noted at parse 
operation 614. 

In another embodiment of the invention, the end user 

60 specifies the fault tolerance acceptable for each position in 
the search pattern as noted at tolerance operation 708 of FIG. 
7. For example, given the search pattern PFML-L, the end 
user might require an exact match for the "Last Name" 
positions but tolerate some discrepancies in other positions. 

65 Alternatively, the end user might tolerate a discrepancy in 
the "Last Name" positions for the same search pattern, if all 
of the other positions in that search pattern are exact 
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matches. Thus, a number of different fault tolerance sce- 
narious can be created for each search pattern. 

The preferred embodiment of the invention includes pre- 
built fault tolerance logic that end users can easily incorpo- 
rate into their search patterns. For example, the invention s 
will allow an end user to automatically classify data as a 
match for parsing purposes if the first and/or last letters 
match; to allow for common typographical errors or letter 
transpositions; and to allow for mistakes made because 
words or letters sound alike. The end user can also write its 10 
own fault tolerance logic, which will be stored by the 
invention and incorporated into the process at run time when 
input for the relevant search pattern is run. 

Upon completion of this process, the end user has mapped 
Input patterns to the output table, which output table is in ]5 
turn mapped to the reference list(s). The mapping of the 
input patterns to the output table creates different search 
patterns. Data input for each Input pattern is tested through 
the search patterns in an effort to find a match to the 
reference list(s). Specified fault tolerance indicates for each 2 o 
search pattern position (i.e., for each output table symbol 
within the search pattern) the divergence from the reference 
list that the end user will tolerate as a match in each position. 

The end user may evaluate search patterns generated by 
the invention, to eliminate any search patterns that the end 25 
user believes will not match data input to the reference 
list(s). The preferred embodiment of the invention employs 
logical rules to speed processing by eliminating unlikely or 
unhelpful search patterns before parsing against the refer- 
ence list(s) begins. For example, the invention will eliminate 30 
any search pattern that does not require an exact match on 
at least one field, so that the invention ensures a minimum 
level of certainty for any match. The invention will also 
eliminate certain illogical search patterns, such as search 
patterns that interrupt strings of identical output table entries 35 
(e.g., a search pattern "Name Prefix," "First Name," "Name 
Prefix" would be eliminated even if this pattern could be 
generated as a result of end user mapping to the output table, 
because the "Name Prefix" position would not logically be 
interrupted by the "First Name" position). The invention 4Q 
also analyzes fault tolerances for each search pattern and 
eliminates those fault tolerances that would necessarily be 
satisfied if another fault tolerance for that same search 
pattern were satisfied, so as to eliminate unnecessary search- 
ing. An individual skilled in the art will recognize that these 45 
logical rules and other similar rules could be added or 
subtracted from the invention, depending upon market pref- 
erences as well as the desired combination of accuracy and 
efficiency, 

Householding speed and efficiency are maximized by 50 
maximizing use of system memory to reduce the time 
necessary for I/O exchanges with an external device. The 
end user's GUI-generated householding specifications are 
used to identify those fields within the reference list(s) that 
are relevant for parsing, as noted at field operation 706 of 55 
FIG. 7, and the invention eliminates from the process all 
reference list(s) fields to which the end user has not mapped 
any output table symbols. Separate parsing table(s) (each a 
"parsing table") are created from the reference list(s), which 
parsing tables consist only of (i) those fields to which the end 60 
user has mapped output table symbols and (ii) analytical 
information relating each of those fields, as noted at table 
operation 712 of FIG. 7. As noted earlier, an end user may 
have more than one reference list, so the invention will 
create more than one corresponding parsing table, as nec- 65 
essary. In many cases, there are several small parsing tables 
used to parse particular kinds of data input (e.g., "Name 
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Prefix" or "First Name") to which only search pattern 
position is mapped and a larger, more comprehensive pars- 
ing table to which several positions in the same search 
pattern are mapped. 

The invention uses the end user's search pattern informa- 
tion to build indices on these parsing table(s) to facilitate the 
location of parsing table data during the parsing process. In 
the event that a parsing table has no more than one position 
in each search pattern mapped to it, the invention will access 
the parsing table and the accompanying field-length infor- 
mation and build an index that stores the length of each 
value in the parsing table. In the event that a parsing table 
has more than one position in a search pattern mapped to it, 
the invention will access the parsing table and the accom- 
panying field-length information and build separate indices 
for each search pattern position mapped to the parsing table. 
Each of these indices will store the length of each value in 
the parsing table, the actual data values themselves and the 
number of times that each data value appears in the parsing 
table. 

In a preferred embodiment of the invention, the parsing 
table(s) and the parsing table indices are decompressed (if 
previously stored in compressed form) and loaded into 
system memory for fast, efficient reference during the pars- 
ing process. In another exemplary method according to an 
embodiment of the invention, the contents of each parsing 
table index and the contents of each parsing table are sorted 
into blocks as they are moved into system memory, as noted 
at sort operation 714 of FIG. 7, according to the number of 
bytes contained in the index entries and parsing table fields 
(e.g., all index entries and all parsing table fields consisting 
of only one byte will be sorted into the first block of index 
entries or fields, as the case may be — all index entries and 
all parsing table fields consisting of two bytes will be sorted 
into the second block of index entries or fields, etc.). Within 
these blocks sorted by byte length, parsing tables fields and 
parsing table indices containing actual data values are fur- 
ther sorted into alpha/numeric order by data value. The 
invention generates code that "points" from each index entry 
to the block of parsing table fields containing the same 
number of bytes or, when the actual data value is included 
in the parsing table index, to the block of parsing table fields 
having an identical data value, as described in more detail 
below. In the event that any parsing table contains two or 
more fields, each of which must be parsed, the parsing table 
and its index will be sorted into boyte length and data value 
order by a primary key field, described below. 

Having constructed search patterns from the end user's 
specifications and having reorganized parsing tables and 
parsing table indices and moved them into memory, the 
invention builds a series of execution tables, as noted at 
build operation 716 of FIG. 7, which are designed to 
optimize parsing by eliminating at an early stage in the 
process those search patterns that will not generate a match 
between the end user's data input and the parsing table(s). In 
a method according to the invention, the first execution table 
sorts the search patterns in order to optimize the efficiency 
of the parsing process after the search patterns are generated 
but before the invention accesses any of the data input for 
processing. 

For an individual input pattern, there will typically be 
many search patterns, as described above. For each indi- 
vidual input pattern, the first execution table examines each 
position within each of the search patterns to determine 
which search patterns should be parsed first in order to 
maximize efficiency. Using the end user's specifications, the 
first execution table analyzes all of the search patterns 
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generated by a given input pattern and identifies the output group of search patterns, saving processing time and system 

table symbol for which an exact match is required most resources. In the event that a match were found, the inven- 

frequently in the same search pattern position as noted at tion would continue with the same search pattern and 

analyze operation 802 of FIG. 8. Once this output table determine which data value to retrieve next from the data 

symbol and position are identified, the search patterns are s input according to the ranking of search patterns with this 

sorted so that all search patterns having the designated first group. 

output table symbol in the designated search pattern position If exact matches were found for each of the search pattern 

will be analyzed first. positions requiring an exact match to the parsing table(s), the 

For example, if the first execution table determined for a invention would next retrieve data input for those positions 

particular Input pattern that an exact match to output table a0 in the same search pattern for which some fault tolerance 

symbol "F' (for "First Name") was required in the second was specified by the end user. This process would continue 

search pattern position more often than an exact match was for the same search pattern until a match was found between 

required for any other output table symbol in any other the data input and the parsing table(s) in each position or 

search pattern position, the first execution table would use until an attempted match failed and the search pattern was 

this information to prioritize all search patterns requiring an 35 eliminated from the process for the purposes of analyzing 

exact match for "F" in the second position. If there were that particular data input. If all search patterns were elimi- 

several search patterns requiring an exact match "F" in the nated without a match, the data input would fall into a 

second position, the first execution table would prioritize catchall default "bucket" established by the end user for data 

search patterns within this grouping further. Having priori- input that does not match a parsing table. The invention 

tized all search patterns requiring an exact match for "F" in 20 would then begin the process with new data input, 

the second search pattern position, the invention would then In order to use the search patterns to compare data values 

prioritize search patterns within this group according to the to parsing table(s), the invention will retrieve one data value 

output table symbol that must be an exact match in the same at a time from the data input, as set forth above. In the event 

position in the greatest number of these search patterns. This that a search pattern position is mapped to both a subsidiary 

process continues until the first execution table has priori- 25 parsing table, to which only one position in a search pattern 

tized all search patterns in this group according to the output is mapped, and to a more comprehensive parsing table, to 

table symbols in the same position that require an exact which more than one search pattern position is mapped, the 

match as not prioritize operation 804 of FIG, 8. invention will compare the data value to the subsidiary 

Once the first execution table has completed the search parsing table before proceeding to the comprehensive pars- 
pattern sort described above, it will continue its analysis of 30 in 6 table. 

those remaining search patterns that have not yet been In order to parse against a parsing table to which only one 

sorted. The first execution table identifies the output table search pattern position is mapped, the invention follows the 

symbol for which an exact match is required most frequently following procedures for those search pattern positions 

in the same search pattern position, excluding those search requiring an exact match to the parsing table(s). First, the 

patterns that were included in the first group. Once this 35 invention retrieves the data value from the data input and 

output table symbol and position are identified, all search calculates the number of bytes contained in the data value, 

patterns having the designated output table symbol in the The length of the retrieved data value is compared to the data 

designated search pattern position are sorted into a second value length information stored in the applicable parsing 

group of search patterns. Within this second group, search table index. As noted earlier, the entries in the index are 

patterns are organized in the manner described above for 40 sorted into blocks according to the number of bytes in each 

organizing search patterns within the first group. The first entry. If the invention does not locate an entry in the parsing 

execution table continues this process until it has sorted all table index containing the same number of bytes as are 

of the search patterns and prioritized them by identical contained in the data value, the invention determines that 

output table symbols in identical positions requiring an exact there cannot be a match to the parsing table, and all search 

malc h t 45 patterns requiring an exact match in this position can be 

Upon execution of a request, the second execution table immediately eliminated for the purposes of analyzing this 

will use the search patterns to attempt to match the data data input. Thus, an entire group of search patterns would be 

input, as retrieved at access operation 806 of FIG. 8, to the eliminated at this early stage in the process, before extensive 

parsing table(s), as noted at match operation 808, so that a amounts of time and resources were consumed in an attempt 

match will identify the reference list field to which the data 50 to match the entire data input to a parsing table, 

input corresponds. The second execution table will begin its If a block of index entries having the same byte length as 

processing with the first group of search patterns and move the data value is found in the index, the invention will use 

sequentially through each search pattern until the search the code that "points" to the parsing table to proceed directly 

pattern is eliminated or a complete match to the parsing to the block of records in the parsing table with data values 

table(s) is found as noted at query operation 812 of FIG. 8. 55 having the same number of bytes. As noted earlier, the 

In the example given above, where the first group of search parsing table is sorted into blocks depending upon the 

patterns required an exact match for "First Name" in the number of bytes contained in the data value and then input 

second position, the second execution table would first alpha/numeric order by data value within these blocks. The 

access data stored in the second position of the first data invention proceeds to the middle of the block of parsing 

input and determine whether the data for this position 60 table values containing an appropriate number of bytes and 

matches the "First Name" position in the parsing table(s) to then performs a binary search, moving either higher or lower 

which "First Name" is mapped, as described in more detail within the block depending upon whether the data value for 

below. In the event that this data valued did not match the which the invention is searching occurs alphabetically or 

relevant parsing table(s), all search patterns requiring an numerically before or after the previous value accessed. In 

exact match "F" in the second position could be immediately 65 the event that a match for this data value is not found within 

eliminated from the process for this data input at eliminate the parsing table, it can be immediately determined that all 

operation 810 and the invention would move to the second search patterns requiring an exact match in this position can 
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be immediately eliminated for this data input. Thus, an entire 
group of search patterns would be eliminated at this stage io 
the process, before extensive amounts of time and resources 
were consumed in all attempt to match the entire data input 
to a parsing table. In the event that a match for this data value 
were found within the parsing table, the invention would 
build a virtual table in memory, at noted at store operation 
814ofFIG.8, to store the match information while it returns 
to retrieve the next data value from the data input and 
continues with the search pattern. 

In some cases, an end user, knowing of potential faulty 
data input, might include for parsing purposes a parsing 
table that contains incorrect data values. For example, a 
parsing table might contain words with intentional misspell- 
ings so that if the data input contains a misspelled data value, 
the invention will still locate a match in the parsing table and 
will not discard a correct search pattern simply because a 
data value is misspelled. In this case, the end user might 
specify that a different, "cleansed" data value with a cor- 
rected spelling should be stored by the invention in memory 
for inclusion in the data output at the end of the process. The 
invention stores each data value that matches the parsing 
table in a virtual table in memory until it completes its 
parsing for the entire search pattern and discovers whether 
or not the data input will be a complete match. In the event 
that a complete match is achieved, the invention will note 
the position of each data value within the search pattern and 
store the entire data input for eventual output. In the event 
that the data input fails to match through a given search 
pattern at any point in the process, the data stored in the 
virtual table for that search pattern will be discarded. The 
process described above is repeated for each data value for 
which the end user has required an exact match to the 
parsing table. 

In the event that an individual search pattern has more 
than one position mapped to a parsing table, which parsing 
table contains related data values corresponding to more 
than one output table symbol, matching to the parsing table 
will proceed according to a different process from that 
described above. Specifically, when two or more different 
search pattern positions are mapped to related data values in 
the same parsing table, the invention will not examine the 
data values for each search pattern position in isolation. 
Rather, the invention will require that two or more data 
values received from the same data input match two or more 
related data values in the parsing table. For example, if one 
parsing table contains related "First Name" fields inserted in 
the same records and "Last Name" fields and both the "First 
Name" ("F") and "Last Name" ("L") search pattern posi- 
tions are mapped to this same parsing table, the invention 
will not attempt to match "First Name" and then attempt to 
match "Last Name" in isolation; it will require that related 
"First Name" and "Last Name" data inputs match both the 
"First Name" and the "Last Name" entries in the same 
parsing table record. 

In order to parse data values from the same data input that 
are mapped to the same parsing table, the invention retrieves 
the first of these data values, prioritized as set forth above, 
which data value is the primary key. Upon retrieval of this 
data value by the invention, the invention calculates the byte 
length of the data value and proceeds to the parsing table 
index in the same manner as set forth above. As noted 
previously, however, the index for the primary key contains 
the byte length of the various data values and the actual data 
values themselves, sorted by alpha/numeric order and the 
number of times each of those data values is repeated in the 
parsing table. After locating the block of index entries 
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containing the same number of bytes as the data value, the 
invention proceeds to the middle of this block and performs 
a binary search, moving either higher or lower within the 
block depending upon whether the data value for which the 

5 invention is searching occurs alphabetically or numerically 
before or after the primary key value. 

If a match to the data value is found in the parsing table 
index, the invention accesses the information regarding the 
number of times that the data value is repeated in the parsing 

10 table and stores this information in a virtual table in memory. 
The invention then uses the index code that "points" to the 
corresponding block of records in the parsing table having 
the same data values in this position and proceeds directly to 
that block (unless the data value is unique to one parsing 

15 table entry, in which case the index would point to the 
specific entry). If a match were not found at any stage of the 
second execution table's attempt to match the data input to 
the index or to the parsing table, all search patterns requiring 
an exact match for this data value in this position for this 

2Q data input would be immediately discarded, as discussed 
above. 

At this stage, the invention would retrieve the second of 
the data values mapped to this parsing table, prioritized as 
set forth above. This data value would be the secondary key 

25 for the search against the parsing table. Upon retrieval of this 
data value by the second execution table, the invention 
calculates the byte length of the data value and proceeds to 
the separate parsing table index built for the secondary key. 
The index is searched in the same manner as set forth above. 

30 As noted previously, however, the index for the secondary 
key also contains the actual data values themselves, sorted 
by alpha/numeric order. After locating the block of index 
entries containing the same number of bytes as the data 
value, the invention proceeds to the middle of this block and 

35 performs a binary search, moving either higher or lower 
within the block depending upon whether the data value for 
which the invention is searching occurs alphabetically or 
numerically before or after the secondary key value. 
Because the parsing table is organized into data blocks 

40 according to the primary key, the index cannot point to a 
block of records with data values identical to the secondary 
key. Instead, if a match to the data value for the secondary 
key is found in the parsing table index, the invention uses 
index code that "points" to the first data value in the parsing 

45 table having the same data value in the same position as the 
secondary key. The parsing table entry containing this 
identical value also contains code that "points" to the next 
location in the parsing table that contains an identical data 
value in this position, and that location points to the next 

50 location containing an identical data value in this position, 
through to the end of the parsing table. Thus, the first 
primary key entry in the parsing table is linked to all other 
entries having an identical value in the same position 
because the parsing table has been arranged to group all of 

55 these values consecutively in a block. The first parsing table 
entry containing a data value that is identical to the second- 
ary key value is linked to all other entries having an identical 
value in the same position because each such position within 
the parsing table contains code that identifies the next 

60 position having an identical value. 

At this stage, the invention accesses the information 
regarding the number of times that the specific data value for 
the primary key and for the secondary key is repeated in the 
parsing table and compares this information for the primary 

65 key and the secondary key, as at compare operation 902 of 
FIG. 9. In the event that the primary key value appears fewer 
times in the parsing table than the secondary key value, the 
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invention will proceed to the first record in the block be eliminated. When the parsing process is completed, the 

containing the primary key value and begin to search each end user will have a complete and standardized set of 

record in the block to determine whether it can also match information within the end-user created specifications 

the secondary key value to any record in that block. In the (match parameters) and fault tolerance (permitted deviation 

event that the secondary key value appears fewer times in 5 f r0 m the reference list). As noted above, this information is 

the parsing table than the primary key value, the invention stored each ^ mc a match for an cnt j re data input is found 

will proceed to the first record containing the secondary key using one of lhe patterns generated by the invention, 

value and begin the search each record containing the The invention stores, for output purposes: (i) the data input, 

secondary key value to determine whether it can match the the paUem positions into which the data input 

primary key value to any record to which this first record is 1Q Ms; and substitute or adjusted data added during 

linked. By beginning with the key value that matches the {hQ sg lQ CQrrect fault data input 

fewest records, as at compare operation 904, the invention , , ... 

... ' . ^ t ao „L a j Tt,^ Those data inputs that did not match the parsing tables are 

minimizes the number of records that must be searched. The . ;.„«. , 1 4U a r.u tu 

. t , t . „ „, , ft „ mr j ,u fl collected in a catchall bucket at the end of the process. The 

invention creates a virtual table in memory to record the . . r 

■#u* *u (n u fl f«, _l „' j ™* end user can view or print this data to examine why it did not 

posit on within the parsing table for each record contaimng < . . . . U i / \ a i a- * * * *u 

* t. r i_ .u • a a 1 15 match the parsing tablets) and make adjustments to the 

matches for both the primary and secondary keys. ^ & v/ . l r n * «l 

, ,. . , , . process if necessary to ensure a match tor some or all or the 

If a match for either the primary or secondary key fails at a*** 

, , , , . . reiDdiniii^ udLd. 

any stage in this process, all search patterns requiring an ° ^ . 

exact match for these data values in these positions for this ™* P"^ data W c [ eate u d b y the ulve J nUon can lhen 

data input would be immediately discarded, as discussed 20 be matcbed or householded with other records in the same 

a k QVe file or in other files to demonstrate previously unrecognized 

T , ' , , * j „ , c ac a i relationships among records or with records from another 

In the event that a match were found for the secondary key * f , , . . , . . 

e Jf , . , . / , A source. For example, because the data input has been run 

in one of the records for which a match was already found *. u 

c . , . . ... , . c a through the search patterns and matched to the parsing 

for the primary key, or in the event that a match were found 6 . , , . „./ * * 

f ^ . 1 f4 v a f „u n ™„*m, tables, it is now readily apparent that data inputs Mrs. Ann 

for he primary key in one of the records for which a match 2 s A c i » «t< u li a c Kx, i a 

/ , I * , 4U , , ... „ ZD A. Smith-Jones/ "The Honorable A. Smith-Jones, and 

was already found for the secondary key, the invention ' T „ . , . , . . 

, 7 , , t . t , ' ./ . t , . r- "Mary Ann A. Smith- Jones contain data values that belong 

would return to the data input and access the data value for . ^ . . t i, rtll „L ,u!! 

. ^ * rt in different positions in the reference list, even though the 

the next search pattern position that .. mapped to the same J . dentified aU ^ f t b attem 

parsing table, if any, prtonUzed as set forth ^aboye ^e re ,s , iWWLW . w „ ^ ^ , he ^ Uerns l0 „ tnese 

no tndex in memory for this thud search pattern position 30 ? J P were ^ from 

however. Instead, the invention proceeds directly to the ^ f . , .\ o „rK„^ 

. ! . f . f. , -.11 the reference list) the invention has discovered where each 

virtual table storing information about the location of all ' u , , . 

j r * ■ i , i c a c *u , *~a data value fits within each record as compared to the 

records for which matches were round tor the primary and , , , . . 

j i tu • -ii k * t« 5 th« reference list. The examples given above would have been 

secondary keys The invention will begin to compare he ^ ^ ^ search & 

data value for the third search pattern position to he 35 "ppml.l « "PPFL-L " and "FFML-L" respectively, so that 

appropriate position within each record for which a match ! , , ^u m /„ t ih*. 

r> ac tt. * j rt j_ u„, rc . a the invention can now store and output each element ot the 

was found for the primary and secondary keys. A virtual . . • . -A. 

, i . t . i r ii ^ - t u- *u data input m its proper position. The invention may also 

table stores the location of all records within the parsing ^ F , f r # - n , io ,^^ o „ ^ 

* Li r l- l * t, • r a f~ *l» r f™J ^ have cleansed the data as set forth above, depending upon 

table for which a match is found fo the data stored m he Actions provided by the end user. Became the 

third search pattern positron. By attempung to match toe n ^ M ^ „ ^ (<The HonorabIe „ as 

data stored in the third search pattern position only to the , . , MX , A „ . „. 

, c . 4 , c . ■ . . j Name Prefixes and determined that "Mary Ann is a First 

subset of parsing table records for which the primary and „ fiv ™« 

*1 & . , , . t . . i( _ Name rather than a Name Prefix, the invention can now 

secondary keys matched, the invention speeds the process , . ' 1i( ffl , ro ., 

, j • »u i V • ♦ li j * compare and match these data inputs to publicly available 

by reducing the number of parsing table records to which the r * /x f - L ^ j f 

7 f « reference hst(s) or other data in a straightforward manner for 

data must be compared. 45 , . u* 

m . .„ . . . . t * householdmg purposes. 

The invention will continue to access the data input tor ^ • ■ -i . 4 t. 

, . , -.. • ■ avn „t ™ 0 #„l Matching is performed in a manner similar to the way m 

subsequent search pattern positions requiring an exact match »««.v*m B y j 

to the same parsing table and to search for matches in the ^ in P ul data was ^ atcbed t0 table > exce P H l 

parsing table(s) as described above for the third search ^ lf P arsin ? has alrcad y b ecn completed, there is no need 

t . t1 « u ™^,.ivi«««»« to generate input patterns and to use search patterns to 

pattern position until all search pattern positions requiring an so , & . t ^ j . • * n ^ 

, * l 4 *l • ^^fi^ determine the proper positions ot data inputs. Kather, the 

exact match to the same parsing table are satisfied or until * * • u.r j ■ /tr * 

. . „ ^ Q A,__ t ^% invention can perform a straightforward comparison of First 

the search patterns are exhausted. VT a \ XT r .vt * t Ar 

. f i tJ r iif.u l Names to First Names, Last Names to Last Names, etc. 

When this process has bee completed for all of the search a , , . „ , , 

pattern positions requiring an exact match to the parsing . As noted above, each record is sequentially numbered as 

table(s), the third execution table will access the data input 55 * * b ™? bt "*> ^ sequential numbers 

for the first search pattern position in the first remaining remain ^ thesc filcs throughout the parsing and matching 

search pattern for which the end user has required a match process. 

but indicated that some deviation from the exact data Individuals skilled in the art will recognize that many 

contained in the parsing table(s) is permissible, as at access different parsing scenarios could be constructed using the 

operation 1002 of FIG. 10. The invention then retrieves the 60 end-user-driven, execution table process described above, 

data value for this search pattern position as at retrieve The above explanation is intended to describe a process for 

operation 1004 and attempts to match it to a parsing table as allowing end users broad flexibility to tailor their house- 

at match operation 1006, applying fault tolerance logic as holding request without sacrificing the speed and efficiency 

specified by the end user as at tolerance operation 1008. In of the householding process. 

the event that a match is not found, all search patterns 65 In order to accomplish these goals, the end-user-defined 

applying the same fault tolerance to the same data value in householding specifications are used to create (i) an indexed 

the same search pattern position for the same data input can parsing table that can be held in memory, and (ii) levels of 
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execution tables that can be used to rapidly eliminate records 
that do not match end user specifications for incoming data 
or patterns found in the parsing table. 

This parsing and matching can be used in combination 
with the other process described above to execute a house- 5 
holding request on data stored in a remote data store. For 
example, according to the process of making header infor- 
mation and logic representations available to end users, as 
described above, the data store makes a representation of 
householding logic available to remote end users. The data 1° 
store also makes available the layout of parsed data that it 
must receive in order to execute the householding logic that 
it has made available. An end user, using the client GUI 
application described above, accesses the householding rep- 
resentation remotely, captures that representation, and inte- 15 
grates it into a request, as described above. The end user then 
adds its own header information, as described above, to 
integrate its own data into the householding process. 

Before attaching its selected data input to the request, the 
end user parses the data input according to the end -user- 20 
driven, execution table method described above. The end 
user uses the GUI to establish parsing specifications that 
correspond to the specifications made electronically avail- 
able by the data store so that the data store householding 
process will have the information that it needs to effectively 25 
execute the householding request. 

Once the parsing is completed, the end user appends its 
parsed data input to the request. The request is then sub- 
mitted to the data store. 3Q 

Standard Internet TCP/IP addressing conventions route 
the request to the proper server at the data store, which 
server is in communication with the data object store. The 
APL instructions ensure that the data store automatically 
households the parsed end user data input with the specified 3S 
data store data input. 

The result set generated by this householding process 
forms a new compressed data object, consisting of 
combined, householded data input from the end user and 
data store, and header information describing the combined 40 
result set. Optionally, the end user may specify additional 
data store logic to be executed on the combined data before 
a result is returned. 

MULTIPLE DATA STORES 

45 

The GUI can access and manipulate header and logic 
information from any data store and/or vendor that has used 
the APL to store its data in the object form described above. 
An end user, therefore, can access header and logic infor- 
mation from multiple data stores and use the GUI to build a 50 
single request containing header and logic information from 
these various data stores. 

In an exemplary method according to the invention, each 
data store receives a unique identification key. When the 
Merge process is executed, the identification key is embed- ss 
ded in the header information of each universal data object 
as unformatted data. Therefore, when a data store executes 
a Get Header request to extract header information, the data 
store identification key can be transferred with the header 
information to the network server, although it is not visible 60 
to or accessible by the data store, the end user, or any third 
party. 

Because the data store identification key is stored with the 
header information (and any logic information made avail- 
able by the data store), it is also automatically transferred to 65 
the client application when the end user accesses the header 
information remotely. The client application, therefore, can 
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use the identification key to distinguish among header and 
logic information received from different third party sources 
and from the end user. 

This distinction can be made tangible to the end user by 
representing header and logic information on the GUI in 
different colors or with a symbol used to identify the data 
store from which the information was accessed. Because the 
identification key is embedded in the information, it will also 
be included in any request that uses such information. 

A completed request will be addressed to visit the data 
stores that have information included in the request header, 
to automatically access raw data and logic at each data store 
according to the APL instructions, and to return a result, 
without the necessity of having any special interface for 
different data stores. Thus, a single request can perform 
numerous data retrieval and data processing operations at 
numerous data stores. 

For example, an end user might wish to household its own 
data with data stored at Data Store A to produce a result set 
for processing according to logic made available at Data 
Store B. This end user would electronically visit each data 
store to access the desired header and logic information. 
After accessing this information and integrating it on the 
GUI to build a customized request, the end user would attach 
its own data to the request. 

Depending upon the request constructed by the end user, 
the process would use the identification key associated with 
the header and logic information to address the request to the 
appropriate data stores in the appropriate order. In the 
example given above, the process would automatically 
address the request to Data Store A for householding and 
data retrieval and then to Data Store B for logic execution. 
Upon arrival at Data Store A, the householding request 
would be automatically executed and compressed raw data 
from Data Store A would be appended to the compressed end 
user data that was attached to the request and a new, 
compressed object would be created with header informa- 
tion corresponding to the new, householded raw data. The 
entire request would then be forwarded to Data Store B, 
where the request would call Data Store B logic for execu- 
tion on the entire compressed mass of householded data. The 
result set from this process would be formatted as specified 
by the end user and returned. 

The invention includes logic to prevent the request from 
failing at any particular data store because it lacks necessary 
information or logic stored at another data store. In a method 
according to the invention, the request will avoid such a 
"deadly embrace" by travelling from one data store to 
another to retrieve and store information necessary to 
execute the request until it has retrieved enough information 
to actually execute according to its instructions. In other 
words, the system will not halt simply because the request 
needs more data or logic before it can finally execute the 
request. 

REQUEST SECURITY 

The exemplary embodiment of the invention includes 
compression and transmission processes that provide secu- 
rity for all outgoing request objects and incoming results. In 
the preferred embodiment of the invention, compression is 
dynamic so that each byte constituting header information 
and/or raw data is translated differently each time it is 
compressed, and identical bytes of header information and 
raw data may be represented differently throughout the same 
compressed object. Further, each request object is transmit- 
ted in separate compressed data blocks. 
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A data store will not be permitted to open a request or a 
part of a request until it has received all of the data blocks 
comprising the request and presented an identification key 
that matches information stored in the header. The identifi- 
cation key itself is dynamically compressed and disbursed 5 
among the separately transmitted data blocks, so that no 
individual data block has the information necessary to open 
the block. Dynamic compression renders unidentifiable even 
that portion of the identification key included with an 
individual block. 10 

Upon the arrival of the request at the data store, the data 
store is able to access only a limited section of the header 
information transmitted with the data block, which section 
specifies the position of that block in the entire request. The 
data store will not be able to access additional header 15 
information or the raw data until it has received and 
assembled all of the pieces of the request. Thus, if a portion 
of the request is misdirected or intercepted, the contents of 
each compressed data block will be inaccessible. 

Once the entire request is received by the data store, the 20 
data object is assembled. After the object is assembled, the 
data store is permitted to search another limited section of 
the header information for unformatted information that 
matches the data store identification key. 

In the event that the data store does not find a match, the 
data object will not open and the instructions and raw data 
will remain inaccessible. In the event that the data store does 
find a match, the header information and raw data pertaining 
to that data store only will become accessible, and process- 30 
ing will begin according to the APL instructions. 

A request that is being circulated to multiple data stores 
will allow each data store to access only that part of the 
header and raw data that is necessary to execute the request 
instructions at that data store. In the example given above, 35 
therefore, Data Store A would execute a householding 
request, but it would not access or identify the logic to be 
executed at Data Store B. 

After the householding was completed, Data Store A 
would construct a compressed object with the householded 40 
Data Store A and end user data, create an appropriate header 
and send the new, compressed request object on to Data 
Store B. 

Because Data Store B requires access to the entire mass 
of raw data in order to execute specified logic, it will be 45 
permitted to access all of the attached, householded raw 
data, including the data added at Data Store A. 

Because the request contains logic that will instruct it to 
circulate among the different data stores gathering relevant 
information until a complete request can be executed, secu- 
rity will not prevent the execution of a multiple data store 
request, and a multiple data store request will not compro- 
mise the security of the data object. 

The invention may be embodied as a computer program 5S 
product. A computer program product includes a recording 
medium, such as a floppy disk, a high-capacity read only 
memory in the form of an optically read compact disc or 
CD-ROM, a tape, a transmission type media, such as a 
digital or analog communications link, or a similar computer 6Q 
program product. 

It will be apparent to one skilled in the art that the manner 
of making and using the claimed invention has been 
adequately disclosed in the above- written description of the 
preferred embodiment(s). 55 

It will be understood that the above described preferred 
embodiment(s) of the present invention are susceptible to 
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various modifications, changes, and adaptations, and the 
same are intended to be comprehended within the meaning 
and range of equivalents of the appended claims. 

Further, although a number of equivalent components 
may have been mentioned herein which could be used in 
place of the components described with reference to the 
preferred embodiments), this is not meant to be an exhaus- 
tive treatment of all the possible equivalents, nor to limit the 
invention defined by the claims to any particular equivalent 
or combination thereof. 

A person skilled in the art would realize that there may be 
other equivalent components presently known, or to be 
developed, which could be used within the spirit and scope 
of the invention defined by the claims. 

What is claimed is: 

1. A method of using execution tables, wherein a struc- 
tural search is performed comprising: 

analyzing all search patterns generated for a given input 
pattern and identifying the output table symbol for 
which an exact match is required most frequently in the 
same search pattern position; 

prioritizing the search patterns identified in the previous 
step to eliminate at the beginning of the process the 
largest number of records that will not match the 
parsing table and continuing to prioritize search pat- 
terns using the same process until all search patterns are 
in order; 

accessing input data according to the highest priority 
search pattern and determining whether the input data 
and parsing table, position determined by the highest 
priority search parameter, matches; 

eliminating subsequent searches according to whether a 
match in the accessing step occurred and whether such 
a match is required in subsequent searches; 

repeating accessing and eliminating steps until the input 
data matches the parsing table in relevant positions or 
until the search fails; 

moving the matched data into a virtual memory table to 
retrieve input data, position determined by position of 
next search parameter, without losing the location of 
the matched data for the previous position, and 

storing the frequency of matches of input data and parsing 
table data, position determined by search parameter, in 
a virtual memory table. 

2. The method of using execution tables according to 
claim 1, wherein a second execution table performs the input 
data matching search comprising: 

analyzing only the search patterns from the virtual 
memory table to match data input to the parsing table 
(s); and 

matching the input data to the parsing table in order of the 
position and end user preferences, starting with the 
record fields that require an exact match and proceed- 
ing to fields in which the end user specified some 
tolerance data in data. 

3. The method according to claim 2 wherein in the event 
that more than one field requires a match, the second 
execution table will examine the primary position first, 
followed by the secondary position. 

4. The method of using execution tables according to 
claim 1 wherein there are two or more positions selected by 
an end user, comprising: 

comparing the frequency of the primary position matches 
verses the frequency of secondary position matches; 
and 
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beginning with the position value that matched the fewest 
records, comparing the primary position value to the 
primary position and the secondary position value to 
the secondary position to assure both values match for 
records moved into virtual memory by the execution 5 
tables. 

5. The method of using execution tables according to 
claim 1, wherein after the first two positions are matched, 
any other required matching is performed only on the 
remaining records that were moved into virtual memory. 10 

6. The method of using execution tables according to 
claim 1, wherein a third execution table matches the input 
data to the parsing table for those fields that the end user 
selected as allowing for error tolerance, comprising: 
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accessing the data input for the first search pattern posi- 
tion in the first remaining search pattern for which 
deviation is permissible; 

retrieving the data value for this search pattern position; 

matching the data value to the parsing table; and 

applying the fault tolerance logic as defined by the end 
user; 

wherein a final, customized parsing search will contain: 

(i) a complete and standardized set of information 
matching the key parameters; and 

(ii) a complete and standardized set of information with 
the fault tolerance as defined by end user. 

***** 
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